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Abstract: Distance measures between trees are useful for comparing trees in a systematic 
manner, and several different distance measures have been proposed. The triplet and quartet 
distances, for rooted and unrooted trees, respectively, are defined as the number of subsets 
of three or four leaves, respectively, where the topologies of the induced subtrees differ. 
These distances can trivially be computed by explicitly enumerating all sets of three or four 
leaves and testing if the topologies are different, but this leads to time complexities at least 
of the order or just for enumerating the sets. The different topologies can be counted 
implicitly, however, and in this paper, we review a series of algorithmic improvements that 
have been used during the last decade to develop more efficient algorithms by exploiting 
two different strategies for this; one based on dynamic programming and another based on 
coloring leaves in one tree and updating a hierarchical decomposition of the other. 

Keywords: algorithmic development; tree comparison; triplet distance; quartet distance 
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1. Introduction 

Evolutionary relationships are often represented as trees; a practice not only used in biology, where 
trees can represent, for example, species relationships or gene relationships in a gene family, but also 
used in many other fields studying objects related in some evolutionary fashion. Examples include 
linguistics, where trees represent the evolution of related languages, or archeology, where trees have 
been used to represent how copies of ancient manuscripts have changed over time. Common for most 
such fields is that the true tree relationship between objects is never observed, but must be inferred, and 
depending on both the data available and the methods used for the inference, the inferred trees will likely 
be slightly different. 

Tree distances provide a formal way of quantifying how similar two trees are and, for example, 
determine if two trees are significantly similar or no more similar than could be expected by chance. 
Many different distances have been defined on trees. Most only consider the tree topology, i.e., how 
nodes are related, but not how branch lengths might differ between the trees, and most of these only 
consider relationships between labeled leaves and consider inner nodes as unlabeled. Commonly used 
distance measures are the Robinson-Foulds distance [1] and the quartet distance [2] for unrooted trees 
and the triplet distance [3] for rooted trees. All three are based on the idea of enumerating all sub- 
structures of the two trees' topology and counting how often the structure is the same in the two trees 
and how often it is different. 

The Robinson-Foulds distance considers all the ways the leaf labels can be split into two sets and 
counts how often only one of the trees has an edge matching this split. Informally, this essentially means 
that it counts how often the two trees have the "same edge" and how often an edge in one tree has no 
counterpart in the other. Edges are arguably the simplest element of the topology of a tree, and not 
surprisingly, the Robinson-Foulds distance is both the most frequently used distance measure and the 
distance measure that can be computed with the optimal algorithmic complexity, 0{n), for trees with n 
leaves [4]. 

The triplet and quartet distances enumerate all subsets of three or four leaves, respectively, and count 
how often the topologies induced by the three or four leaves are the same in the two trees. The triplet 
distance is intended for rooted trees, where the triplet topology is the smallest informative subtree (for 
unrooted trees, all subtrees with three leaves have the same topology), while the quartet distance is 
intended for unrooted trees, where the quartet topology is the smallest informative subtree. Whether 
it is possible to compute the triplet and quartet distance in linear time is unknown. The fastest known 
algorithms have time complexity 0{n log n) for both distances for binary trees, 0{n log n) for the triplet 
distance and 0{dn log n) for the quartet distance for general trees, where d is the largest degree of a node 
in the trees [5]. 

In this paper, we will review the algorithmic development that led to these non-trivial running times, 
in particular, the development of algorithms for general (non-binary) trees to which the authors have 
contributed a number of papers. We will first formally define the triplet and quartet distance between 
two leaf-labeled trees. We then describe the state-of-the-art for binary trees with two different approaches 
to computing the distances: one based on dynamic programming with time complexity 0{v?) for 
both quartet and triplet distance and one based on coloring leaves in a tree traversal with complexity 
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O(nlogn) for quartet distances (with very little work, this algorithm can be modified to compute the 
triplet distance, but it has never been described as such in the literature). The dynamic programming 
approach was the first algorithm that was modified to deal with the quartet distance for general trees, 
and in the following section, we describe a number of approaches to doing this. In the same section, 
we describe how the tree coloring approach can be adapted to general trees, leading to the currently 
best worst-case running times mentioned above. We then turn to practical implementations, present 
experimental results for the best algorithms and show that the theoretically fastest algorithms can also 
be implemented to run efficiently in practice. 

2. The Triplet and Quartet Distances 

Given two trees, Ti and T2, each with n leaves with labels from the same set of n labels, the triplet 
distance is defined for rooted trees, while the quartet distance is defined for unrooted trees. 

A triplet is a set, A;}, of three leaf labels. This is the smallest number of leaves for which 
the subtree induced by these leaves can have different topologies in two rooted trees. The possible 
topological relationships between leaves i, j and k are shown in Figure la. The last case is not possible 
for binary trees, but it is for trees of arbitrary degrees. The triplet distance is defined as the number of 
triplets whose topology differs in the two trees. It can naively be computed by enumerating all (g) sets 
of three leafs and for each set comparing the induced topologies in the two trees. 

A quartet is a set, {z, j, k, i}, of four leaf labels. This is the smallest number of leaves for which 
the subtree induced by these leaves can have different topologies in two unrooted trees. The possible 
topologies are shown in Figure lb. The last case is not possible for binary trees (trees with inner nodes 
of a degree of three), but it is for trees of arbitrary degrees. For the remaining three cases, the set, 
{i, j, k, i}, is split into two sets of two leaves by the middle edge. We sometimes use the notation, ij \ ki, 
for splits, with this particular instance referring to the leftmost topology in Figure lb. Similar to the 
triplet distance, the quartet distance is defined as the number of quartets whose topology differ in the two 
trees. It can naively be computed by enumerating all (2) sets of four leafs and for each set comparing 
the induced topologies in the two trees. 



Figure 1. Different cases for triplet and quartet topologies. 



.^'^ ^^fe '>^! 
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(a) Triplet topologies. (b) Quartet topologies 



The rightmost cases for triplets and quartets in Figure la,b are called unresolved, while the remaining 
are called resolved. Both for triplets and quartets, the possible topologies in two trees, Ti and T2, 
of arbitrary degree can be partitioned into the five cases, A-E, listed in Figure 2. Note that in the 
resolved-resolved case, the triplet/quartet topologies may either agree {A) or differ {B) in the two trees. 
In the resolved-unresolved and unresolved-resolved cases (C and D), they always differ, and in the 
unresolved-unresolved case {E), they always agree. For binary trees, only the two resolved-resolved 
cases are possible. 
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Figure 2. Cases for computing differences. 
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Since both distances are defined as the number of differing topologies, they can be computed as 
B + C + D. However, none of our algorithms compute B, C and D explicitly. Different algorithms use 
different relationships between the five counters, A-E, to compute the distances faster. 

The quartet and triplet distances are known to be more robust to small changes in the trees than other 
distance measures, including the Robinson-Foulds distance [6]. They also have a much larger range than 
many other distance measures for trees and are, therefore, regarded as more fine-grained. The maximal 
value of the normalized quartet distance, (i? + C + -D)/(^), obtained by dividing the distance by the 
total number of quartets was shown by Bandelt and Dress to be monotonically increasing with n, and 
they conjectured that this ratio is bounded by 2/3 [7]. Steel and Penny furthermore showed that the 
normalized mean value of the distances, E,(B + C + -D)/(^), tends to 2/3 as n oo and that the 
variance of {B + C + D) / (^) tends to zero as n — i- oo, when all (binary or general) trees are sampled 
with equal probability [6]. A further desirable feature of the quartet and triplet distances is that quartets 
and triplets play a fundamental role in many approaches to phylogenetic tree reconstruction (see, e.g., 
the R* method in Dendroscope [8] or the Quartet MaxCut method [9]). 

The parameterized triplet and quartet distance is defined by Bansal et al. [10] as B + p ■ {C + D), for 
some 0 < p < 1, and makes it possible to weigh the contribution of unresolved triplets/quartets to the 
total triplet/quartet distance. Forp = 0, unresolved triplets/quartets do not contribute to the distance, i.e., 
unresolved triplets/quartets are ignored, while for p = 1, they contribute fully to the distance, as is the 
case in the unparameterized triplet/quartet distance. Bansal et al. recently showed that the parameterized 
triplet and quartet distances are: i) metrics if p E [1/2, 1]; ii) distance measures, but not metrics (the 
triangle-inequality is violated) if p G (0, 1/2); and iii) not distance measures if p = 0 (two non-equivalent 
trees can have a "distance" of zero) [10]. Whether unresolved topologies should contribute to the distance 
depends on whether one considers an unresolved node a statement of knowledge of a multifurcation or 
simply a lack of knowledge of the true topology. See, e.g., Pompei et al. [11] and Walker et al. [12] for 
a discussion of dealing with unresolved topologies as a lack of knowledge in linguistics. 

3. Computing the Triplet and Quartet Distances between Binary Trees 

In this section, we consider the binary case, i.e., where all topologies are resolved and C, D and E 
from Figure 2 are zero. In this case, the distances are the B counter, which can be obtained by computing 
either A or B, since B = (2) — A for the triplet distance and B = (2) — A for the quartet distance. 

We will describe two algorithms for computing the quartet distance for binary trees, one based 
on dynamic programming and the other on coloring leaves and comparing topologies induced by the 
coloring. These are also the two approaches that we have used to handle general trees, and we will 
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describe the extensions for general trees in the next section. Variations of the two approaches to compute 
the triplet distance have also been developed, but in our main exposition, we focus on the quartet distance. 

3.1. A Dynamic Programming Approach 

The first algorithm, from Bryant et al. [13], is based on two ideas: orienting edges and associating 
directed edges to quartets and, then, building tables of the number of leaves shared between two subtrees. 

We first conceptually take all edges in the two trees and replace them with two oriented edges. This 
way, we can uniquely assign three subtrees to each oriented edge: Fi behind the edge and F2 and F3 in 
front of the edge (see Figure 3a). We will write this as Fi A (F2, F3) and say that the directed edge, 
Fi A (F2, F3), claims all quartets, ij\M, with i, j e Fi, k e F2 and i e F3 (or £ e F2 and k G F3). If 
a tree contains the quartet topology, ij\ki, it will be claimed by exactly two such edges: Fi A (F2, F3) 
with i,j e Fi,k e F2 and £ e F^ {or £ e F2 and k e F3); and Gi -4 {G2, G3) with k,i e Gi,i E G2 
and j E G3 (or j E G2 and i E G3) as in Figure 3b. 



Figure 3. Any quartet is claimed by exactly two edges. 




(a) Fi A Fa) (b) Gi A (G2, G3) 



We will consider these two oriented quartets, ij — > k^ and ij 4— k£, where e claims ij — )■ ki and 
e' claims ij ki. Any shared quartet topology, ij\ki, corresponds to exactly two oriented quartet 
topologies, so we can compute A by counting the number of shared oriented quartet topologies and 
divide by two. 

Given two trees, Ti and T2, the algorithm now simply iterates through all edges, Fi A- {F2, -F3) in Ti 
and Gi A- {G2, G^) in T2, and counts how many oriented quartets, ij — )■ ki, are claimed by both ei and 
62- This number, denoted A(ei, 62), can be computed as: 

A(ei, 62) = (^^' 2 ^'') (1^2 n G2I ■ IF3 n G3I + IF2 n G3I ■ IF3 n G2I) 

where |F n G| denotes the number of leaves in two subtrees, F and G, with the same labels. This 
expression can be computed in constant time if we have tables containing the count, \F fl G|, for all 
subtrees, F of Ti and G of T2. The number of shared quartets (and by extension, the quartet distance) 
can then be computed as: 

ei6Ti e2eT2 

in time O(ri^). 

Tables for |F fl G| can be computed recursively in time O(n^): if F contains subtrees, Fi and F2, and 
G contains subtrees, Gi and G2, then |F fi G| = |Fi fl Gi| + |Fi n G2I + IF2 fl Gi| + IF2 n G2I. If F or 
G (or both) are leaves, similar, but simpler, cases are used. 
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By modifying this algorithm slightly, it can also be used to compute the triplet distance between 
rooted binary trees in 0{n^) time. 

3.2. Tree Coloring 

Brodal et a/.'s 0{n log^ n) and 0{n \ogn) algorithms [14,15] take a different approach than dynamic 
programming, but they are also based on oriented quartets. Rather than associating oriented quartets 
to edges, however, the algorithms associate oriented quartets to inner nodes. Given an inner node, t>, 
with three incident subtrees, F, G and H, we say that v claims the oriented quartets, ij — )■ ki, where 
i and j are in two different subtrees of v and k and £ both are in the remaining subtree. Node v thus 
claims ('^ ') • \G\ ■ \H\ + |F| ■ (' ^'j ■ \H\ + \F\ ■ \G\ ■ ('^') oriented quartets, and similar to before, each 
quartet is associated with exactly two inner nodes. Furthermore, similar to before, the idea is to compute 
the double sum ^ = | J2vieTi J2v2eT2 ^(^i' ^2)5 where A(t>i, V2) now denotes the number of oriented 
quartets (in the remainder of this section, we will use quartet and oriented quartet interchangeably) 
associated with nodes, vi and V2, which induce the same quartet topology in both Ti and T2. The two 
algorithms, however, only explicitly iterate over the nodes in Ti, while for each vi E Ti, they compute 
^v2eT2 ^(^1' ^2) implicitly using a data structure called the hierarchical decomposition tree of T2. To 
realize this strategy, both algorithms use a coloring procedure in which the leaves of the two trees are 
colored using the three colors. A, B and C. For an internal node, vi E Ti, we say that Ti is colored 
according to vi if the leaves in one of the subtrees all have the color. A, the leaves in another of the 
subtrees all have the color, B, and the leaves in the remaining subtree all have the color, C. Additionally, 
we say that a quartet, ij -> k£, is compatible with a coloring if i and j have different colors and k and 
i both have the remaining color. From this setup, we immediately get that if Ti is colored according to 
a node, Vi E Ti, then the set of quartets in Ti that are compatible with this coloring is exactly the set of 
quartets associated with t>i and, furthermore, if we color the leaves of T2 in the same way as in Ti, that 
the set of quartets, 5", in T2 that are compatible with this coloring is exactly the set of quartets that are 
associated with vi and induce the same topology in both Ti and T2; thus, X]i,2gT2 ^(^1,^2) = |'S'|. 

Naively coloring the leaves of the two trees according to each inner node in Ti and counting the 
compatible quartets in T2 explicitly would take time O(n^), as we would run through n — 2 different 
colorings and perform 0{n) work for each. By handling the coloring of Ti recursively and using the 
"smaller-half trick", we can ensure that each leaf changes color only O(logn) times, giving 0{n logn) 
color changes in total. To compute the number of compatible quartets in T2 for each coloring, we use a 
hierarchical decomposition of T2. The main feature of this data structure is that we can decorate it in a 
way such that it can return the number of quartets in T2 compatible with the current coloring in constant 
time. To achieve this, the hierarchical decomposition uses O(logn) time to update its decoration when a 
single leaf changes color. Thus, in total, we use 0{n log^ n) time [14]. The algorithm for computing the 
triplet distance between binary trees in time 0(nlog^ n) from [16] also uses this strategy, although the 
decoration of the hierarchical decomposition differs significantly. In the following two subsections, we 
describe the coloring procedure and the hierarchical decomposition tree data structure used in the quartet 
distance algorithm. Finally, in Section 3.2.3. , we describe how the algorithm was tweaked in [15] to 
obtain time complexity 0{n log n). 
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3.2.1. Coloring Leaves in Ti Using the "Smaller-Half Trick" 

The coloring procedure starts by rooting Ti in an arbitrary leaf. It then for each inner node, vi E Ti, 
calculates the number of leaves, \vi\, in the subtree rooted at Vi. This is done in a postorder traversal 
starting in the new (designated) root of Ti, and the information is stored in the nodes. In this traversal, 
all leaves are also colored by the color. A, and the new root is colored by the color, C. The procedure 
then runs through all colorings of Ti recursively, as described and illustrated in Figure 4, starting at the 
single child of the root of Ti. In Figure 4, large(f i), respectively small(7;i), denotes the child of vi that 
constitutes the root of the largest, respectively smallest, of the two subtrees under vi, measured on the 
number of leaves in the subtrees, and get_shared ( ) is a call to the hierarchical decomposition of T2 to 
retrieve the number of quartets in T2 that are compatible with the current coloring. During the traversal, 
the following two invariants are maintained for all nodes, Vi E Ti. (1) when Count {vi) is called, all 
leaves in the subtree rooted at vi have the color. A, and all other leaves have the color, C; and (2) when 
Count {vi) returns, all leaves have the color, C. As illustrated in Figure 4, these invariants imply that 
when get.shared ( ) is called as a subprocedure of Count (^i) , then Ti is colored according to Vi. 
Hence, the correctness of the algorithm follows, assuming that get_shared ( ) returns the number of 
oriented quartets in T2 compatible with the current coloring of the leaves. 

Figure 4. Traversing Ti using the "smaller-half trick" to ensure that each leaf changes color 
at most O(logn) times. 



Procedure Count {vi ) : 
if vi is a leaf : 

color VI C 

return 0 
else : 

ColorLeaves (smaZZ(vi) , B) 
shared = get_shared () 
ColorLeaves (smaZZ(vi) , C) 
shared += Count (Zar'(5(e(i'i) ) 
ColorLeaves (smaZi(t;i) , A) 
shared += Count (sTOan(ui) ) 
return shared 




;i) At line 1 (2) After line 6 




(3) After line 8 (4) After line 9 




(5) After line 10 (6) After line 11 



Note that the color of a specific leaf is only changed whenever it is in the smallest subtree of one of its 
ancestors. Since the size of the smallest of the two subtrees of a node, f 1, is at most half the size of the 
entire subtree rooted at vi, this can only happen at most logri times for each leaf, leading to 0{n\ogn) 
color changes in total. 
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3.2.2. The Hierarchical Decomposition Tree 

The hierarchical decomposition tree of T2, HDT{T2), is a representation of T2 as a binary tree data 
structure with logarithmic height. The nodes in HDT(T2) are called components, and each of them is 
of one of the six types illustrated in Figure 5. The nodes in T2 (including leaves) constitute the leaves of 
HDT(T2): each leaf in T2 is contained in a component of type (i), and each inner node of T2 is contained 
in a component of type (ii). An inner node/component of HDT{T2) represents a connected subset of 
nodes in T2, and it is formed as the union of two adjacent components using one of the compositions 
in Figure 5(iii)-(vi), where two components are adjacent if there is an edge in T2 connecting the two 
subsets of nodes they represent. The root of HDT{T2) is formed using the composition in Figure 5(vi), 
and it represents the entire set of all nodes (including leaves) in T2. 

Figure 5. The six different types of components in a hierarchical decomposition tree. Type 
(i)and (ii) consist of, respectively, a single leaf or an inner node in the original binary tree. 
Type (iii)-(vi) are formed as unions of two adjacent components. Type (vi) occurs only at 
the root of the hierarchical decomposition tree. 





( 



(ii) 



(iii) 



(iv) 



(v) 



(vi) 



To build HDT(T2), we first make a component of type (i) or (ii) for each node in T2 and, then, greedily 
combine and replace pairs of components using the compositions in Figure 5(iii)-(vi) in subsequent 
iterations, until we only have the single root component left. This greedy strategy requires O(logn) 
iterations using 0{n) time in total, and the result is an HDT of height 0{\ogn) (see [14,15] for details). 

To be able to retrieve the number of quartets in T2 from HDT{T2) in constant time, we decorate it 
in the following way: For each component, C, in HDT(T2), we store a tuple (a, 6, c) of integers and a 
function, F. The integers, a, h and c, denote the number of leaves with colors A, B and C, respectively, 
in the connected component of T2 represented by C. The function F takes three parameters for each 
external edge of C (all components have between zero and three external edges, depending on their type, 
e.g., type (i) components have one external edge, while type (iii) components have two). If C has k 
external edges, we number these arbitrarily from one to k and denote the three parameters for edge i by 
Oj, bi and q. Of the two ends of an external edge, one is in the component and the other is not. The 
parameters, Oj, bi and q, denote the number of leaves with colors A, B and C, respectively, which are in 
the subtree of T2 rooted at the endpoint of edge i that is not in C. Finally, F states, as a function of these 
parameters, the number of elements of the quartets, which are both associated with nodes contained in 
C and compatible with the current coloring of the leaves. It turns out that F is a polynomial of a total 
degree of at most four. 

It is beyond the scope of this paper to describe the details of how this decoration of HDT(T2) is 
initialized and updated, but the crucial point is that for a component, C, which is the composition of 
two components, C and C", the decoration of C can be computed in constant time, provided that the 
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decorations of C and C" are known. This means that the decoration can be initialized in 0{n) time in 
a postorder traversal and that it can be updated in O(logn) time when a single leaf changes color by 
updating the path from the type (i) component containing the leaf to the root of HDT{T2). 

3.2.3. Tweaking the Algorithm to Obtain Time Complexity 0{n logn) 

To further decrease the time complexity of our algorithm, we need a crucial lemma, which was 
hinted at in exercise 35 of [17] and stated as Lemma 7 in [15]. This lemma, which was named 
"the extended smaller-half trick" states that if T is a binary tree with n leaves and we have a cost of 
Ct, = |small(f)| log |gn^[n(-^-)| for each inner node v and c^, = 0 for each leaf v, then we have a total cost 
of X]i;eT Cv < n logn. This means that if we can decrease the time used per inner node vi in Ti from 
|small(f i) I log n, as in the previous two subsections, to |small(t'i) | log |snijn(|^^|-)| , the total time used will 
be reduced from 0(n log^ n) to 0{n logn). 

The first step in doing this is to use Lemma 3 from [15], which tells us that if we are given an unrooted 
tree, T, with n nodes of a degree of at most three, where k leaves have been marked as non-contractible, 
then we can contract T in 0(n) time into a contraction with at most 4A; — 5 nodes, such that each 
non-contractible leaf is a node by itself. This means that if we, prior to visiting a node, f i G Ti, mark 
all leaves in the subtree of Ti rooted at vi as non-contractible and contract T2 according to this marking, 
then we get a contraction, T2^^\ with at most 4|t>i| — 5 nodes, and thus, we can build a hierarchical 
decomposition tree, HDT{T2^^), from this contraction with height 0(log in time 0(|i;i|). Hence, 
if we use HDT{T2^'') instead of HDT{T2) when visiting vi G Ti, the time needed for visiting t>i, 
disregarding the time spent on building T2^\ is now Odsmal^vi)] log |fi|). However, this is still too 
much to use the extended smaller-half trick to obtain our goal. 

To get all the way down to 0(|small(t;i)| log |smaii\li)| )' ^^^^ observe that a hierarchical 
decomposition tree is a locally-balanced binary rooted tree. A binary rooted tree is c-locally-balanced 
if for all nodes v in the tree, the height of the subtree rooted at v is at most c(l + log |f |). To be 
exact, our hierarchical decomposition trees are 1/ log (jj) -locally -balanced (see [15] for the proof of 
this). A locally-balanced tree with n nodes has the property that the union of k different leaf-to-root 
paths contains just 0(A;log |) nodes (see Lemma 4 in [15]). Thus, since eacht;i G Ti HDTiT^"'^) has 
Odtiil) nodes, the |small(fi)| leaf-to-root paths in HDT{T2^^) that need to be updated when visiting 
t>i contain 0(|small(fi)| log i^^J;;]^^) nodes in total. This means that if we spend only constant time 
to update each of these, we get a total time for coloring and counting, still disregarding the time used 
to build contractions, of 0{n\ogn) by the extended smaller-half trick. To spend only constant time for 
each of the 0(|small(fi)| log j^^^J^j^^) nodes, we split the update procedure for HDT{T2^^) after the 
|small(f 1) I color changes in two in the following way: (1) We first mark all internal nodes in HDT [t!^^^] 
on paths from the |small(fi)| type (i) components to the root by marking bottom-up from each of the 
type (i) components, until we find the first already marked component. (2) We then update all the marked 
nodes recursively in a postorder traversal starting at the root of HDT{T2^^). 

We now use 0(n log tt.) time for the coloring and counting, but if we built Tj^^^ from scratch for 
every inner node, f 1 in Ti, we would spend O(n^) time just doing this. To fix this problem, we will 
only contract T2 whenever a constant fraction of the leaves have been colored C, and we will not do it 
from scratch every time. Figure 6 shows how this is achieved using the subroutines. Contract and 
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Extract. In this pseudo-code, T2 and are used to refer to the tree, T2, as well as their associated 
hierarchical decomposition trees. 

Figure 6. Tree coloring algorithm. 



Procedure FastCount(v, T2 ) : 
local var 
if V is a leaf : 

color V C 

return 0 
else : 

ColorLeaves ( small (v) , B, T2) 

shared = get_shared() 

T2 = Contract(B, Extract ( small (v ) 

ColorLeaves ( small ( V ) , C, T2 ) 

if IT2I > 5 ■ llarge (v)| : 

T2 = Contract T2 ) 

shared += FastCount ( large (v) 
ColorLeaves ( small ( V ) , A, T2) 
shared += FastCount ( small ( v ) 
return shared 



T2) 



T2)) 



The routine. Contract (A", T2) , constructs a decomposition of T2 of size 0(|A:'|), where each leaf 
with the color, X, has been regarded as non-contractible and where \X\i% the number of leaves with the 
color, X. This is done in 0(|T2|) time, where IT2I is the number of nodes in the current version of T2. 
Note that T2 is not static, but is a parameter to FastCount and is changed in specific recursive calls. 
See [15] for the details of Contract. The routine. Extract (small(fi), T2) , uses the hierarchical 
decomposition of T2 to extract a copy of T2 at the point in the algorithm where Ti is colored according 
to vi. The extracted tree is a copy of T2 where all leaves in the subtree rooted at small (fi) still have the 
color, B, but all other leaves have the color C. The first call to Contract in line 9 builds a contraction, 
T2, of this copy, and this contraction is used as T2 in the recursive call on small (fi) in line 15. The 
second call to Contract in line 12 is only executed when IT2I > 5|large(i;i)| {i.e., when more than 
4/5 of the leaves have the color, C). This contraction is used in the recursive call on large (fi) in line 13. 

Line 9 takes 0(|small(fi)| log (see [15]); thus, since IT2I = It'll when we perform it, the 

total time used on this line is 0{n \ogn) by the extended smaller-half trick. We now consider the time 
spent on contracting T2 in line 12. We perform this line whenever IT2I > 5|large(fi)|. Since all leaves 
in large(f 1) are colored A when we contract, the size of the new T2 has at most 4|large(f 1) | — 5 nodes. 
Hence, the size of T2 is reduced by a factor of 4/5. This implies that the sequence of contractions applied 
to a hierarchical decomposition results in a sequence of data structures of geometrically decaying sizes. 
Since a contraction takes time 0(|T2|), the total time spent on line 12 is linear in the initial size of T2, i.e., 
it is dominated by the time for constructing the initial hierarchical decomposition tree, which is 0{n). 
In total, we therefore use 0{n log n) time for contracting and extracting T2. 

4. Dealing with General Trees 

The main focus on algorithms for tree comparison has been on binary trees. This is understandable, 
since most algorithms for constructing trees will always create fully-resolved trees, even when some 
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edges have very little support from the data (but, see Buneman trees or refined Buneman trees for 
exceptions to this [18,19]). Trees that are not fully resolved do occur in studies, however, and this 
creates a problem when the algorithms for computing tree distances do not generalize to general trees. 

In our research, we have developed a number of approaches for adapting the algorithms from the 
previous section to work on general trees, and we review these approaches in this section. 

4.1. Dynamic Programming Approaches 

Problems with generalizing the dynamic programming algorithm for binary trees to general trees are 
two-fold. First, using edges to claim quartets is only meaningful for resolved quartets, and secondly, 
the resolved quartets claimed by an edge can be distributed over a large number of trees. The first 
problem can possibly be dealt with by using nodes to claim unresolved quartets, but the second problem 
is more serious. 

We can still compute the tables of the intersections of trees, |F fl Gj. If nodes can have a degree up 
to d, then in the recursion for building tables |F fl G|, we need to combine counts for 0{d^) pairs of 
trees, but the total sum of degrees for each tree is bounded by 0{n). Therefore, the product of the pairs 
of degrees is bounded by O(n^). Worse, however, is computing A(ei, 62) where we need to consider all 
ways of picking trees F2 and F3 and pairing them with choices G2 and G3, with a worst-case performance 
of 0{d^) for each pair of edges. 

In a series of papers, we have developed different algorithms for computing the quartet distance 
between general trees efficiently by avoiding explicitly having to deal with choosing pairs of trees for 
inner nodes. Common for these is that we also avoid explicitly handling unresolved quartets, but only 
consider the resolved topologies and handle the unresolved quartets implicitly. 

Christiansen et al. [20] developed two algorithms for computing the quartet distance between general 
trees. The idea in the first algorithm, which runs in time 0{n^), is to consider all triplets, {i,j, k}, 
and count for each fourth leaf-label, x, how many of the quartets, {i,j, k, x}, have the same topology 
in the two trees; whether this be resolved topologies, ix\jk, ij\xk or ik\jx counting A, or unresolved, 
(ijkx) counting E. Counting this way, each shared quartet topology will be counted twelve times; so, 
the algorithm actually computes 12 ■ (A + E), and we must divide by twelve at the end. We explicitly 
iterate through all 0{n^) triplets, and the crux of the algorithm is counting the shared quartet topologies 
in constant time. 

The approach is as follows: For each triplet, {i,j, k}, we identify the "center", C, of the induced 
topology in both trees. Let Fj, Fj and F^ be the trees in Ti connected to C containing leaves i, j and k, 
respectively, and let Gi, Gj and Gk be the corresponding trees in T2. A resolved quartet, ix\jk, is shared 
between the two trees if x is in Fj and in Gi (see Figure 7), so the number of such quartets is |Fj fl Gj | — 1 
(minus one because we otherwise would also count ii\jk). Therefore, given i, j and k, the number of 
shared resolved quartets is: 

\Fi n Gil + \Fj nGj\ + \Fk n Gfel - 3 

which can be computed in constant time from the tables we build in preprocessing using dynamic 
programming, assuming we know the centers for the triplet. 
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Figure 7. Computing quartets between high-degree nodes. 





Now, let F^ijk denote all the leaves in Ti, except those in Fj, Fj and Fk, and similarly, let G-ijk 
denote the set of leaves in T2 not in G,, Gj and Gk- The number of unresolved quartets [ijkx] is then 
given by \F_ijk fl G-jjfel. We cannot directly build a table of all \F^ijk fl G-i^fel in O(n^), since just 
from the number of indices (z, j and k), we would need a table of size 0(r2'^). Instead, we can compute: 

where: 

= n - (|G,| + \Gj\ + 
(we can tabulate \Gi\ for all £ in 0(n) in preprocessing) and: 

\G.ijk n F,| = \Fi\ - {\F, n Gil + |F, n G,-| + |F, n G^l) 

for i = k; so, we can also count how many unresolved topologies we have for (ijkx) if we have 
built tables in preprocessing and we know the center nodes. 

To get the running time of 0{n^), we thus only need to find the center nodes in constant time for each 
triplet, {i,j, k}. We achieve this by finding a linear number of centers, in linear time, for a linear number 
of triplets. For pairs, j}, we iterate through all k and their corresponding centers in linear time. 

The idea is as follows: for each pair, i and j, we identify the path between them, which is easily done 
in 0{n). Along this path, we then consider all inner nodes and the trees branching off. We can explicitly 
iterate through all k in such a subtree, and the corresponding center node is the node where the subtree 
branches off the path from i to j (see Figure 8). This way, for each pair, {i,j} — of which there are 
0{n^) — we iterate through leaves k in 0{n) and count the number of shared quartets, {i, j, k, x}, giving 
us a total running time of 0{n^). 

The second algorithm in Christiansen et al. [20] completely avoids dealing with unresolved quartet 
topologies by only counting the resolved topologies that are either shared. A, or that differ, B, between 
the two trees . Recall that the quartet distance between trees Ti and T2 are given by -B + G + in Figure 2. 
If we have functions A(Ti,T2) and B(Ti,T2) that compute A and B for the two trees, respectively, 
then we can compute the quartet distance without considering unresolved topologies explicitly, since 
A + B + G = A{Ti, Ti),A + B + D = A(T2, T2), and: 

qdist(Ti,T2) = B + G + D 

= {A + B + G) + {A + B + D)-2-A-B 

= A(Ti, Ti) + A(T2, T2) - 2 ■ A(Ti, T2) - B(Ti, T2) 
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Computing the row and column sums, A + B + C and A + B + D, can also be done faster than using 
the A function; in fact, it can be computed in linear time for both triplets and quartets using dynamic 
programming (see either Christiansen et al. [21] or Brodal et al. [5] for details). 

Figure 8. Handhng all trees hanging off the path from iio j. 




Should one want to compute the parameterized quartet distance instead, it can be done a little more 
cumbersomely, but still using only the A and B counts: 

pqdist(Ti,T2,p) = B+p-{C + D) 

= B+p-[{A + B + C) + {A + B + D)-2-A-2-B] 

= B(Ti,T2) + p- [A(Ti,Ti) + A(T2,T2) - 2 • A(Ti,T2) - 2 • B(Ti,T2)] 

although that is not likely to be an efficient way of doing this. We will not consider it in more detail for 
now, however. 

The second algorithm from Christiansen et al. [20] is based on counting the resolved quartets, whether 
case A or B, using oriented edges, Fi A (F2, F3), and the quartets they claim. Consider oriented edges 
ei and 62 in Ti and T2, respectively, and let f 1 and f 2 denote the destination nodes of the edges. If f 1 
or t>2 are high-degree nodes, then the edges do not correspond to unique claims, Fi —t- (F2, F3) and 
Gi -A (G2, G3), since the trees, F2, F3, G2 and G3, are not uniquely defined (see Figure 9a). To get 
around this, the algorithm transforms the trees by expanding (arbitrarily) each high-degree node into 
a binary tree, tagging edges, so we can distinguish between original edges and edges caused by the 

expansion (see Figure 9b). We then define extended claims, Fi > (F2, F3), consisting of two edges, 

ei and e[, such that ei is one of the original edges, e[ is a result of expanding nodes and such that the 
oriented path from ei to e[ only goes through expansion edges. The extended claims play the role that 
claims do in the original algorithm, and each resolved quartet in the original tree is claimed by exactly 
two extended claims in the expanded tree. 

The algorithm then implements functions A and B by explicitly iterating through all pairs of extended 
claims. Each of the original 0{n) edges are expanded into at most 0{nd) extended claims, where d is 
the maximal degree of the trees, and by counting the equal or different quartet topologies in constant 
time for each pair of extended claims, the total running time is 0{n^d'^). 
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Figure 9. Expanding high-degree nodes. 




(a) The edge ei is not part of a unique Fi,Fi,Fj can be claimed by a unique 

claim, but part of all claims Fi extended claim Fi ''^''"> {Fi,Fj) for all 

{Fi, Fj) for z, j = 2, 3, 4, 5, ^ 7^ j. j = 2, 3, 4, 5, i / j. 



The A function counts the number of equal topologies from the table of |F fl G| counts exactly as the 
algorithm for binary trees: 

A(ei ■ ■ ■ e;, 62 ■ ■ ■ e'2) = (^'^' 2 ^'') (1^2 n Gal ■ IF3 n Ggl + n G3I ■ IF3 n Gsl) 

and: 

A(Ti,T2) = ^ E A(ei---e;,e2---e'2) 

The B function, counting the number of resolved quartets, {i, j, k, i}, with different topologies, also 
uses the tables, but with a slightly different expression. Since we have resolved, but different, quartet 
topologies whenever Fi !l> (^2,^3) claims ij — )■ ki (i.e., i,j G Fi, A; G F2 and £ G F3) and 
Gi )• {G2, G-i) claims ik — > ji, ii jk, jk — )■ i£ or j£ — )• ik, we can count as: 



B{ei ■ ■ ■ e[, 62 ■ ■ ■ e'2) 


= \Fi 


nGil 


iFiDG^l 


|i^2nG2| 




+ 




\Fi 


nCil 




|F2nG3| 


\FsnG,\ 


+ 




\Fi 


nCil 


|FinG'2| 


|F2nGi| 


\F,nGs\ 


+ 




\Fi 


nCil 


|i^inG'3| 


|F2nGi| 


|F3nG2| 





Because of symmetries and counting each quartet in two claims in each tree, this will over-count by 
a factor of four, so the number of resolved, but different, quartet topologies between the two trees is 
counted as: 

^iTuT2) = \ E B(ei---e;,e2---e'2) 

ei-'-e'^eTi e2---eJ,eT2 

Christiansen et al. [21] improved the running time to 0{n'^d) by changing the counting schemes for 
the A and B functions. The gist of the ideas in the improved counting scheme is to extend the table of 
shared leaf counts, \F fl G\, with values \F fl G\, \F fl G\ and \F fl G\, where F denotes the set of 
leaves not in subtree F. Using these extra counts, the algorithm builds additional tables, over-counting 



Biology 2013, 2 



1203 



the number of quartets claimed and, then, adjusting the over-count. For details on this, rather involved, 
counting scheme, we refer to the original paper [21]. 

Using the same basic idea, but with yet another counting scheme and another set of tables counting 
sets of shared leaves, Nielsen et al. [22] developed an 0(n^+") algorithm, where a = (cu — l)/2 
and 0{n'^) is the time it takes to multiply two n x n matrices. Using the Coppersmith- Winograd 
algorithm [23], where u = 2.376, this yields a running time of 0{n'^^^^^) and was the first guaranteed 
sub-cubic time algorithm for computing the quartet distance between general trees. Here, the underlying 
idea is to reduce an explicit iteration over 0{(f) pairs of claims to a matrix-matrix multiplication as part 
of the counting iteration. Again, we refer to the original paper for details [22]. 

4.2. Tree Coloring 

The coloring approach of Brodal et al. [14,15] for binary trees, described in Section 3.2, has been 
extended to general trees, first by Stissing et al. [24] and recently by Brodal et al. [5]. 

The result of [24] is an 0{cPn \ogn) time algorithm for computing the quartet distance between two 
trees, Ti and T2, where d = max{di, c?2} for di, the maximal degree of a node in tree Tj. This was the 
first algorithm for general trees allowing for sub-quadratic time. The approach of [24] stays rather close 
to that of [14,15], but with a decoration scheme adapted to general trees. The dependency, d^,ond stems 
from the HDT data structure having a d factor in its balance bound when applied to general trees and 
from the applied decoration scheme for the HDT requiring 0{d^) time for the update of the decoration 
of a node based on its children after a color change. 

The results of [5] are an O(nlogn) time algorithm for computing the triplet distance and an 
0{dn log n) time algorithm for computing the quartet distance, both for general trees, with d defined as 
above. The improved dependency on d stems from a new definition of the HDT data structure achieving 
good balance for general trees and from a much more elaborate decoration scheme, with a large system 
of auxiliary counters assisting in the calculation of the main counters. 

Johansen and Holt [25] have later shown how to improve the algorithm, such that d = mm{di, ^2}, 
which is of significance if only one of the trees has a high degree. 

The new HDT definition is based on the four types of components shown in Figure 10. Here, the 
base components, constituting the leaves of the HDT, are L components containing a leaf from T2 and 
/ components containing a single internal node from T2. The internal nodes of the HDT are formed by 
C components containing a connected set of nodes with at most two external edges to other components 
and G components containing the subtrees of some of the siblings of a node in T2. During construction of 
the HDT, C and G components are created by the transformations and compositions shown in Figure 1 1 . 
The construction proceeds in rounds, with each round performing a set of non-overlapping compositions. 
It is shown in [5] that in 0{logn) rounds and 0{n) time, a 1/ log(10/9)-locally balanced HDT tree is 
formed. This plays the role of the HDT described in Section 3.2, but now, for general trees. 
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Figure 10. The four different types of components. 



L 





G 



Figure 11. The two types of transformations (top) and the three types of compositions 
(bottom). 




CC^C 



C L 




C^G 





GG^G 



IG^C 



Additional changes include the use oid+1 colors, denoted 0, 1,2,... ,d, and new definitions of which 
node a triplet/quartet is associated with. The node in question is denoted the anchor, and the choice of 
anchors can be seen in Figure 12. This choice forms the basis for the definition of the decoration system, 
which, as said, is quite involved and for which we refer to [5] for details (as well as to [25] for minor 
corrections and for a trimmed variant of the system). 

Figure 12. The anchors (white nodes) of resolved and unresolved triplets and quartets. Edges 
in the figures represents paths in the tree. 

Triplets 



Resolved 



Unresolved 

A 

I j k 



Quartets 



Resolved 
Type a Type /? Type 7 



i j k I i j 



Unresolved 



i j k £ 



i j k 



The remaining parts follow the lines of [14,15] (see Section 3.2). The basic algorithm, corresponding 
to Figure 4 for binary trees, for traversing Ti recursively can be seen in Figure 13. It maintains the 
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following invariants: (1) When entering a node, v, all leaves in the subtree of v have the color, one, 
and all leaves not in the subtree of v have the color, zero; (2) When exiting v, all leaves in Ti have 
the color, zero. To initiahze Invariant (1), all leaves are colored one at the start. As in Section 3.2, the 
operations. Contract and Extract, are then added (analogously to Figure 6) for exploiting a variant of the 
"extended smaller-half trick", in order to arrive at the stated running times. 



Figure 13. The main algorithm performing a recursive traversal of Ti. 



COUNT(t)) 




if f is a leaf 




Color V by the color 0. 




else 




Let ci be the child of v with largest subtree, and let C2, . . 


. Cfc be its remaining children. 


for i = 2 to k 




Color the leaves in the subtree of Ci by the color i. 




1 1 Leaves are now colored according to v 




Query the HDT for the number of triplets/quartets in T2 


compatible with the coloring. 


Add that number to the global count. 




for i = 2 to k 




Color the leaves in the subtree of Ci by the color 0. 




Count(ci) 




for i = 2 to k 




Color the leaves in the subtree of Ci by the color 1. 




COUNT(ci) 





5. Experimental Results 

Most of the algorithms we have described in this review paper have been implemented in different 
software tools, and in this section, we experimentally compare their runtime performance. All 
experiments involved compare randomly generated balanced trees. We note that two randomly generated 
trees are expected to have a large distance, which influences the running time for the coloring algorithms. 
Similar trees require, overall, less updating in the hierarchical decomposition, so random trees are 
a worst-case situation for these. Experiments, not shown here, demonstrate that comparing similar 
trees can be significantly faster [25]. The shape of the trees also affects the running time for the 
coloring algorithms through the "smaller-half trick", with faster running times for very skewed trees 
(for perfectly balanced trees, the number of color changes required is 6(nlogn), while for the other 
extreme, caterpillar trees, the number is Q{n)). 

All experiments in this section were conducted on an Ubuntu Linux Server 12.04, 3.4 GHz 64-bit 
Intel Core 17-3770 (quad-core) with 32 GB of RAM. 

For the quartet distance between binary trees, we performed experiments with three algorithms, 
the 0{n\og^n) time algorithm implemented in QDist [14,26], the 0(n^-^^^) time algorithms from 
Nielsen et al. [22] and the 0{dn log n) time algorithm from Brodal et al. [5]. For the quartet distance 
between non-binary trees, we considered trees with a degree of eight and 128, using only the last two 
of the three algorithms, since the QDist algorithm only works on binary trees. Note, however, that 



Biology 2013, 2 



1206 



the implementation from [5] is not strictly the variation described in the paper. It contains algorithmic 
optimizations improving both the asymptotic and practical runtime as described in [25]. 

Figure 14 shows the runtimes of the three implementations of the quartet distance calculation 
algorithms running on binary trees. All three implementations can operate on trees of a size of up 
to 10,000; only the algorithm from [5] is shown for up to one million leaves. For inputs of less 
than approximately 4,000 leaves, the implementation of the 0(n^-^^^) time algorithm is faster than 
the implementation of the 0{n\o^ n) time algorithm. For all input sizes depicted, however, the 
implementation of the 0{dn\ogn) time algorithm is the fastest, computing the distance between two 
trees with one million leaves in well under two minutes. 



Figure 14. Quartet distance running time on binary trees. 




10^ 10-' 10" 10" 10'' 10*^ 

Number of leaves Number of leaves 



(a) (b) 



Figure 15a compares the two general quartet distance algorithms on trees with a degree of eight, while 
Figure 15b compares the algorithms for trees with a degree of 128. Both algorithms are faster on these 
higher-degree trees than on binary trees. The 0{'n?'^^^) time algorithm is faster on small trees (where the 
exact point where the other algorithm is faster depends on the degree). 



Figure 15. Quartet distance running time on non-binary balanced trees. 




10^ 10-' 10" 10^ 10' 10" 

Number of leaves Number of leaves 



(a) Trees with degree d = 8. (b) Trees with degree d = 128. 
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We note, however, that the runtime of the algorithm from [5], for d somewhere between 256 and 512, 
in fact, increases beyond that of the binary case. For c/ = 1, 024, the runtime of the algorithm has almost 
doubled compared to the binary case, and for d = 2, 048, the runtime has more than tripled compared to 
the binary case (results not shown). 

For the triplet distance between binary trees, we compared the general algorithm from [5] with the 
algorithm from [16] for binary trees. For the general algorithm, we also performed experiments on 
trees with a degree of 8 and 128. Figure 16 shows the measured running times. These experiments 
are summarized in Figure 16. For binary trees, the O(nlog^n) time algorithm [16] is faster than the 
0{n log n) time algorithm [5] for all measured sizes. For the general algorithm, we again observe that it 
performs faster on high-degree trees (where, since the number of leaves is fixed, we have fewer internal 
nodes to handle). 

Figure 16. Triplet distance running time. For the 0{n\og^ n) time algorithm, which only 
handles binary trees, results are only shown for a degree of = 2, while for the general 
algorithm, results are also shown for c? = 8 and d = 128. 




10^ 10=* 10* 10^ 10^ 

Number of leaves 



6. Conclusions 

We have presented a series of algorithmic improvements for computing the triplet and quartet distance 
between two general trees that we have developed over the last decade. Our work has followed two main 
approaches, one based on counting shared topologies using tables of the intersections of subtrees and one 
based on coloring labels and counting compatible topologies using a hierarchical decomposition data 
structure. The second approach has resulted in the currently best worst-case running time of 0{n \ogn) 
for computing the triplet distance and 0{dn log n) for computing the quartet distance. Whether this can 
be improved further is currently unknown. 

While the theoretical fastest algorithms involve rather complex bookkeeping for counting topologies, 
we have shown that they can be implemented to be efficient in practice, as well, computing the distance 
between two trees with a million leaves in a few minutes. With more typical phylogenetic tree sizes, 
with the number of leaves in the hundreds or low thousands, the distance can be computed in less than 
a second. 
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